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Abstract. In this note, we propose the use of sparse methods (e.g. LASSO, Post-LASSO, 
VLASSO, and Post-\/LASSO) to form first-stage predictions and estimate optimal instru- 
ments in hnear instrumental variables (IV) models with many instruments in the canonical 
Gaussian case. The methods apply even when the number of instruments is much larger than 
the sample size. We derive asymptotic distributions for the resulting IV estimators and provide 
conditions under which these sparsity-based IV estimators are asymptotically oracle-efficient. 
In simulation experiments, a sparsity-based IV estimator with a data-driven penalty performs 
well compared to recently advocated many-instrument-robust procedures. We illustrate the 
procedure in an empirical example using the Angrist and Krueger (1991) schooling data. 



1. Introduction 



Instrumental variables (IV) methods are widely used in applied statistics, econometrics, 
and more generally for estimating treatment effects in situations where the treatment status 
is not randomly assigned; see, for example, HI HI [5l [71 [HI [211 [Ml [23 [IS [30] among many 
others. Identification of the causal effects of interest in this setting may be achieved through 
the use of observed instrumental variables that are relevant in determining the treatment 
status but are otherwise unrelated to the outcome of interest. In some situations, many such 
instrumental variables are available, and the researcher is left with the question of which set 
of the instruments to use in constructing the IV estimator. We consider one such approach to 
answering this question based on sparse-estimation methods in a simple Gaussian setting. 
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Throughout the paper we consider the Gaussian simultaneous equation model 



(1.1) 



y2i = D{xi) + Vi 



(1.2) 
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(1.3) 



where yu is the response variable, y2i is the endogenous variable, Wi is a A;i„-vector of control 
variables, and Xi = {z^,w'j)' is a vector of instrumental variables (IV), and {ei,Vi) are distur- 
bances that are independent of Xj. The function D{xi) = 'Ej[y2i\xi\ is an unknown, potentially 
complicated function of the instruments. Given a sample {yii,y2i,Xi),i = 1,. . . ,n, from the 
model above, the problem is to construct an IV estimator for oq = (01,02)' that enjoys good 
finite sample properties and is asymptotically efficient. 

We consider the case of fixed design, namely we treat the covariate values . . . , x„ as fixed. 
This includes random sampling as a special case; indeed, in this case xi, . . . ,Xn represent a 
realization of this sample on which we condition throughout. Note that for convenience, the 
notation has been collected in Appendix A. 

First note that an asymptotically efficient, but infeasible, IV estimator for this model takes 
the form 



where Q„ = En[AiA'^. 

We would like to construct an IV estimator that is as efficient as the infeasible optimal IV 
estimator oj. However, the optimal instrument D{xi) is an unknown function in practice and 
has to be estimated. Thus, we investigate estimating the optimal instruments D{xi) using 
sparse estimators arising from £i-regularization procedures such as LASSO, post-LASSO, and 
others; see [14[ [T^ El [13] • Such procedures are highly effective for estimating conditional 

^In a companion paper, [TT] we consider the important generalization to heteroscedastic, non-Gaussian dis- 
turbances. Focusing on the canonical Gaussian case allows for an elementary derivation of results, considerably 
sharper conditions, and much more refined penalty selection. Therefore, the results for this canonical Gaussian 
case are of interest in their own right. 



{D{xi),w'i)' , di = {y2i,w[)' . 



Under suitable conditions, 



{alQ-^)-^/^^{oi - ao) =d iV(0, /) + op(l) 
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expectations, both computationally and theoreticallyj^and, as we shall argue, are also effective 
for estimating optimal instruments. 

In order to approximate the optimal instrument D{xi), we consider a large list of technical 
instruments, 

n := ifiU-Jip)' := {fl{x^),...,fp{x^)y, (1.4) 

where the number of instruments p is possibly much larger than the sample size n. High- 
dimensional instruments fi could arise easily because 

(i) the list of available instruments is large, in which case, fi = Xi, 

(ii) or fi consists of a large number of series terms with respect to some elementary regressor 
vector Xi, e.g., B-splines, dummies, and/or polynomials, along with various interactions. 

Without loss of generality we normalize the regressors so that E„[/j^] = 1 for j = 1, . . . ,p. 

The key condition that allows effective use of this large set of instruments is approximate 
sparsity which requires that most of the information in the optimal instrument can be captured 
by a relatively small number of technical instruments. Formally, approximate sparsity can be 
represented by the expansion of D{xi) as 

D{xi) = f[l3Q + a{xi), ^/¥.n[a{xif] ^ Cs < ay^/s/n, ||/3o||o = s = o(n) (1.5) 

where the main part ///Sq of the optimal instrument uses only s <C n instruments, and the 
remainder term a{xi) is approximation error that vanishes as the sample size increases. 



The approximately sparse model (1.5) substantially generalizes the classical parametric 
model of optimal instruments of [3] by letting the identities of the relevant instruments 

r = support(/3o) = {jG{l,...,p} : Woj\ > 0} 

be unknown and by allowing for approximation error in the parametric model for D{xi). 
This generalization is useful in practice since we do not know the identities of the relevant 



instruments in many examples. The model (1.5) also generalizes the nonparametric model 
of optimal instruments of [M] by letting the identities of the most important series terms, 
T = support(/3o), be unknown. In this case, the number s is defined so that the approximation 
error is of the same order as the estimation error, y^s/n, of the oracle estimator. This rate 



^Several ^i-regularized problems can be cast as convex programming problems and thus avoid the computa- 
tional curse of dimensionality that would arise from a combinatorial search over models. 
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generalizes the rate for the optimal number s of series terms in [24j by not relying on knowledge 
of what s series terms to include. Knowing the identities of the most important series terms 
is unrealistic in many examples in practice. Indeed, the most important series terms need 
not be the first s terms, and the optimal number of series terms to consider is also unknown. 
Moreover, an optimal series approximation to the instrument could come from the combination 
of completely different bases e.g by using both polynomials and B-splines. 

Based on the technical instruments /i, /p and a sparse method such as LASSO or post- 
LASSO, we obtain estimates of D{xi) of the form 

D{xi) = f^. (1.6) 

Sparse-methods take advantage of the approximate sparsity and ensure that many elements 
of /3 are zero when p is large. In other words, sparse-methods will select a small subset of the 
available instruments. We then set 

Ai = {D{xi),w',)' (1.7) 

to form the IV estimator 

a* = (^n[Aid[]f^ {Kn[Aiyu]). (1.8) 

The main result of this note is to show that sparsity-based methods can produce estimates 
of the optimal instruments Di based on a small, data-dependent set of instruments such that 

(a2Q;i)-i/2^(3* _ ao) 7V(0, /) (1.9) 

under suitable regularity conditions. That is, the IV estimator based on estimating the first- 
stage with appropriate sparse methods is asymptotically as efficient as the infeasible optimal 
IV estimator thus uses D{xi) and thus achieves the semi-parametric efficiency bound. 

Sufficient conditions for showing the IV estimator obtained using sparse- methods to estimate 
the optimal instruments is asymptotically efficient include a set of technical conditions and the 
following key growth condition: 

log^ p = o{n). 

This rate condition requires the optimal instruments to be sufficiently smooth so that a small 
number of series terms can be used to approximate them well. This smoothness ensures that 
the impact of instrument estimation on the IV estimator is asymptotically negligible. 
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The rate condition above is substantive and can not be substantially weakened for the full- 
sample IV estimator considered above. However, we can replace this condition with the weaker 
condition that 

slogp = o{n) 

by employing a sample splitting method. Specifically, we consider dividing the sample into 
(approximately) equal random parts a and b, with sizes = \n/2] and rib = n — Ha- We use 
superscripts a and b for variables in the first and second subsample respectively. The index i 
will enumerate observations in both samples, with ranges for the index given by 1 ^ i ^ na 
for sample a and 1 ^ i ^ for sample b. Let = E^^. [/^^J^/^, k = a,b, j = 1, . . . ,p, and 
Hf^ = diag(a^, . . . ,5^). Then we shall normalize the technical regressors in the subsamples, 
ftj = kl^'r 4 = /^i/^i' ^° ^nAft^] = 1 for and E^J/^^j ^ ^ fo^. = i, . . . We can 
use each of the subsamples to fit the first stage via LASSO and variants, obtaining the first stage 
estimates 13^, k = a,b. Then setting 5f = ff'HaH^^^^,! ^ i ^ Ua, D\ = fP' HbR-^P'',! ^ 
i ^ nb, A\ = {D^, wf'y, k = a,b, we form the IV estimates in the two subsamples: 

aa = EnMtdrr'EnMtyli] Sft = En,[At4']-'^nd4yu]- (1-10) 

Then we combine the estimate into one 

aab = {na^nAAtAf] + n„E„jl^lf ])-i(n„E„jl^lf ]a„ + n„E„,[A^lf ]a„); (1.11) 

where under i.i.d. sampling and random design we can also take 

^ 1^ 1^ , , 

Oiah = -jOLa + -jOLb- (1.12) 

The second main result is to show that 

('^e'Q;')"'/' Vn(Sa6 - ao) -^d iV(0, /) (1.13) 

under suitable regularity conditions. That is, the IV estimator based on estimating the first- 
stage with appropriate sparse methods and sample splitting is asymptotically as efficient as 
the infeasible optimal IV estimator thus uses D{xi) and thus achieves the semi-parametric 
efficiency bound. 
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2. Properties of the IV Estimator with a Generic Sparsity-Based Estimator of 

Optimal Instruments 



In this section, we establish our main result. Under a high-level condition on the rates of 
convergence of a generic sparsity-based estimator of the optimal instrument, we show that 
the IV estimator based on these estimated instruments is asymptotically as efficient as the 
infeasible optimal estimator. Later we verify that many estimators that arise from sparse 
methods satisfy this rate condition. In particular, we show they are satisfied for LASSO, 
Post-LASSO, VLASSO, and Post-VLASSO. 

Theorem 1 (Generic Result on Optimal IV Estimation). In the linear IV model of Section 
1, assume that a^, cTe cmd the eigenvalues of Qn = ¥^n[-AiA[] are bounded away from zero and 
from above uniformly in n. Let Di = f^j3 be a generic sparsity-based estimator of optimal 
instruments Di = D{xi) that obeys as n grows 

WflP- flM2,n + cs + ||G„(/.e,)||^ Wd-Mi = op{l). (2.14) 



Then the IV estimator based on the equation (1.8) is ^/n- consistent and is asymptotically 
oracle- efficient, namely as n grows: 

(c72Q-i)-i/2^(a* _ ao) =d A^(0, /) + op(l), 
and the result continues to hold with Qn replaced by Qn = ]E„[AjA'-], and a1 by = E„[(yij — 



Theorem [T] establishes the sufficiency of (2.14) to derive the asymptotic oracle-efficiency of 



the proposed IV estimator. Under normality of the disturbances e^, and standardized /ij's, we 
have 

||G„(/iei)||oo <p a^^/logp. 
Thus, we shall have that (2.14) holds provided that s^log^p = o{n) by combining the relation 



above, standard approximation conditions (1.5), and typical rates of convergence for sparse 



estimators (as shown in the next section). The remaining conditions are quite standard and 
simply ensure that the optimal instruments would be well-behaved instrumental variables if 
they were known. 

While the conditions of Theorem [T] are quite general, we can weaken the sufficient rate 
condition by employing the split-sample IV estimator described in (1.10) and ( |1.11[ ). 



LASSO METHODS FOR GAUSSIAN INSTRUMENTAL VARIABLES MODELS 7 

Theorem 2 (Generic Result on Optimal IV Estimation via Sample-Splitting). In the linear IV 
model of Section 1, assume that Oy, and the eigenvalues of Qn = IE„,[j4ij4^] are hounded away 
from zero and from above uniformly in n. Suppose that for the generic split-sample estimates 



described in (1.10) 



A'- A'^lb.n, =op(l), k = a,b. (2.15) 



Then the split-sample IV estimator based on the equation {l-ll^ is y/n- consistent and is asymp- 
totically oracle- efficient, namely as n grows: 

{alQ-^r^'^^/^{aab - «o) =d A^(0, /) + op(l), 



and the result continues to hold with Qn replaced by Qn = '&n[-A-iA'^ and by = E„[(yij 



The conditions used in Theorem [2] are quite similar to those in the conditions of Theorem 
[ij The key difference is that the key condition (2.15) may obtain under the weaker rate 



condition that slogp = o(n). Intuitively, weakening of the rate condition is due to the fact 
that using the first-stage coefficients estimated in one subsample to form estimates of optimal 
instruments in the other reduces the overfitting bias that drives the bias and inconsistency of 
two-stage least squares with many instruments. Removing this bias allows one to efficiently 
estimate the second stage relationship while using more instruments than in a case where the 
overfitting bias is not controlled for. It is important to note that the practical gains may be 
offset by the fact that the split-sample IV estimator avoids overfitting bias by fitting the first- 
stage on a much smaller set of observations than the full-sample procedure which generically 
produces a weaker first-stage relationship. Thus, one is potentially trading off overfitting bias 
for weaker instruments and potential weak identification problems. This tradeoff suggests the 
split-sample approach may perform relatively worse than the full-sample estimator in situations 
where instruments are not very strong in finite samples. 
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3. Examples of Sparse Estimators of Optimal IV: LASSO and some of its 

MODIFICATIONS 



Given the sample {{y2i, Xi),i = 1, . . . , n} obeying the regression equations ( 1.2 ), we consider 
estimators of the optimal instrument Di = D{xi) that take the form 

A = D{xi) = (3.16) 

where /3 is obtained by using a sparse method estimator with y2i as the dependent variable 
and fi as regressors. 

Recall that we consider the case of fixed design. Thus, we treat the covariate values fi, . . . , fn 
as fixed since the analysis is conditional on xi, . . . , x^. Also, the dimension p = p„ of each fi is 
allowed to grow as the sample size increases, where potentially p > n. In making asymptotic 
statements we also assume that p — t- oo as n — t- oo. 

Without loss of generality we normalize the regressors so that E„[/j^] = 1 for j = 1, . . . 

The classical AIC/BIC type estimators dUESj) are sparse as they solve the following opti- 
mization problem: 

minQ(/3)+^||/3||o, 

fSeRp n 

where Q(/3) = IEn[(y2j — flf^)"^] and A is the penalty level. Unfortunately this problem is 
computationally prohibitive when p is large, since the solution to the problem may require 
solving "^f^^n (fc) least squares problems (thus, the complexity of this problem is NP-hard 

mm)- 

However, replacing the || • ||o-regularization by a || • ||i-regularization still yields sparse solu- 
tions and preserves the convexity of the criterion function. The latter substantially reduces the 
computational burden and makes these methods (provably) applicable to very high-dimensional 
problems. This method is called LASSO. In the following, we discuss LASSO and several vari- 
ants in more detail. 

1. LASSO. The LASSO estimator solves the following convex optimization problem: 

Garg minQ(/3) + ^||/3||i. 

"""" n 



with A = c • 2(T^A(1 - 7|F) 



(3.17) 
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where c > 1 (we recommend c = 1.1) and A(l — ^\X) is the (1 — 7)-quantile of 

n||En[/i5j]||oo 

conditional on F = [/i, . . . , /„]', with gi ~ A^(0, 1) independent for z = 1, . . . , n. We note that 



A(l — 7|F) ^ y/n^ ^{l — ^/2p) ^ y^2n log(p/7). We set 7 = 1/p which leads to 7 = o(l) since 
p — )■ 00 as n ^ 00. 

2. Post-LASSO. The Post-LASSO estimator is simply ordinary least squares (OLS) ap- 
plied to the data after removing the instruments/regressors that were not selected by LASSO. 
Set 

fi = support(^L) = {iG{l,...,p} : |^Lj| > 0}. 
Then the post-LASSO estimator Ppl is 

^PL G arg min Q(/3) : /3j- = ifj^fi. (3.18) 



3. Square-root LASSO. The VLASSO estimator is defined as the solution to the following 
optimization problem: 

e argrnin 1/^+ (3.19) 

with A = c • A(l - 7|F) (3.20) 
where c > 1 (we recommend c = 1.1) and A(l — 7I-F) denotes the (1 — 7)-quantile of 

n||E„[/,5,]||oo/Ytb|] 

conditional on /i, . . . , /p, with gi ~ N{Q, 1) independent for z = 1, . . . , ra. We set 7 = l/j? which 
leads to 7 = o(l) as n — )■ 00 since p -> co as n grows. 



4. Post-square-root LASSO. The postVLASSO estimator is OLS applied to the data 
after removing the instruments/regressors that were not selected by \/LASSO. Set 

fsQ = support(^5Q) = {i e {l,...,p} : |3'5Qj|>0}, 

and define the post-VLASSO estimator fipsQ as 

dpSQ e arg min Q(/3) : = if 3 ^ Tgq. (3.21) 



The LASSO and VLASSO estimators rely on ^i-norm regularization. By penalizing the l\- 
norm of the coefficients, each estimator shrinks its estimated coefficients towards zero relative 
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to the OLS estimator. Moreover, the kink at zero of the £i-norm induces the estimators to 
have many zero components (in contrast with ^2-iiorm regularization). 

PenaUzing by ||/3||i yields sparse solutions but also introduces a bias towards zero on the 
components selected to be non-zero. In order to alleviate this effect, the post-LASSO and 
postVLASSO estimators are defined as ordinary least square regression applied to the model 
selected by the LASSO or \/LASSO. It is clear that the post-LASSO and post-VLASSO 
estimators remove the shrinkage bias of the associated estimator when it perfectly selects the 
model. Under mild regularity conditions, we also have that post-LASSO performs at least as 
well as LASSO when fh additional variables are selected or LASSO misses some elements of 
/3o- We refer to [9J for a detailed analysis of post-LASSO. 

The key quantity in the analysis of LASSO's properties is the score, the gradient of Q 
evaluated at the true parameter ignoring approximation error: 

5 = 2E„[M]. 

The penalty level A should be chosen so it dominates the noise. Namely, for some c > 1 one 
should set 

A ^ cn\\S\\oc- (3.22) 

Unfortunately, this is not feasible since S is not observed. However, we can choose A based on 
the quantiles of HtSHoo conditional on /i, . . . ,/„. Note that the components of S are normal 
(potentially correlated), so its distribution can be easily simulated. Using the choice of penalty 



level (3.17) for LASSO, it follows that (3.22) occurs with probability 1 — 7. 



The proposed penalty choice for the LASSO estimator depends on the standard deviation 
ay of the disturbances. Typically, is unknown and must be estimated. Relying on upper 
bounds on can lead to an overly-large penalty and thus may result in potentially missing 
relevant components of Pq. The estimation of cr^ could be done as proposed in [lOj under mild 
conditions. The square-root LASSO aims to circumvent this limitation. 

As in the case of LASSO, the key quantity determining the choice of the penalty level for 
\/LASS0 is its score - in this case the gradient of ^J~Q evaluated at the true parameter value 
(3 = ignoring approximation error: 



S :-- 
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Because of the normalization by yE„ [?;?], the distribution of the score S does not depend on 
the unknown standard deviation av or the unknown true parameter value /3o- Therefore, the 
score is pivotal with respect to these parameters, conditional on Thus, setting the 



penalty level as (3.20), with probability 1 — 7, we have 

A ^ cTi 1 1 'S' 1 1 00 . 



We stress that the penalty level in (3.20) is independent of av, in contrast to (3.17). The 
properties of the VLASSO have been studied in [13] where bounds similar to LASSO on the 
prediction norm and sparsity were established. 

4. Properties of IV Estimator with LASSO-based Estimators of Optimal IV 

In this section establish various rates of convergence of the sparse methods described in the 
previous section. In making asymptotic statements we also assume that — )• 00 as n — )• 00. 

4.1. Regularity Conditions for Estimating Conditional Expectations. The key tech- 
nical condition used to establish the properties of the aforementioned sparsity-methods for 
estimating conditional expectations concerns the behavior of the empirical Gram matrix M = 
Knlfifl]. This matrix is necessarily singular when p > n, so in principle it is not well-behaved. 
However, we only need good behavior of certain moduli of continuity of the Gram matrix. The 
first modulus of continuity is called the restricted eigenvalues and is needed for LASSO and 
\/LASS0. The second modulus is called the sparse eigenvalue and is needed for Post-LASSO 
and Post- VLASSO. 

In order to define the restricted eigenvalue, first define the restricted set: 

Ac = {6eW -.115x41 ^C\\6t\\i, 6^0} , 

where T = support(/3o), then the restricted eigenvalues of a Gram matrix M takes the form: 

2 . 6'M6 , ^2 . ^'M6 , 

:= mm s ^ and k^- := mm „ . (4.23) 

These restricted eigenvalues can depend on n, but we suppress the dependence in our notation. 

In making simplified asymptotic statements, we will invoke the following condition: 

Condition RE. For any C > 0, there exists a finite constant k > 0, which can depend on 
C , such that the restricted eigenvalues obey kq ^ k and kq ^ k as n —)■ 00. 
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The restricted eigenvalue (4.23 ) is a variant of the restricted eigenvalues introduced in Bickel, 
Ritov and Tsybakov |14j to analyze the properties of LASSO in the classical Gaussian regression 
model. Even though the minimal eigenvalue of the empirical Gram matrix M is zero whenever 
p > n, [14] show that its restricted eigenvalues can in fact be bounded away from zero. Many 
more sufficient conditions are available from the literature; see |14] • Consequently, we take the 
restricted eigenvalues as primitive quantities and Condition RE as a primitive condition. Note 
also that the restricted eigenvalues are tightly tailored to the £i-penalized estimation problem. 

In order to define the sparse eigenvalues, let us define the m-sparse subset of a unit sphere 

as 

A{m) = {6eRP :\\6\\o^m,\\5\\2 = l}, 
and also define the minimal and maximal m-sparse eigenvalue of the Gram matrix M as 

0min("i) = min 6'M6 and </>max("i-) = max 6'M5. (4.24) 

<5eA{m) 5eA(m) 

To simplify asymptotic statements, we use the following condition: 

Condition SE. For any C > 0, there exist constants < k' < k" < oo that do not depend 
on n hut can depend on C , such that k' ^ <praiii{Cs) ^ (/>max(C''S) ^ k" as n ^ oo. 

Condition SE requires that "small" m x m submatrices of the large p x p empirical Gram 
matrix are well-behaved. Moreover, Condition SE implies Condition RE by the argument given 
in 



It is well known that Conditions RE and SE are quite plausible for both many instrument 
and many series instrument settings. For instance, Conditions RE and SE hold for M = 
^nifif'i] with probability approaching one as n — >• cxd if /j is a normalized form of fi, namely 
fij = fijl ^J^n[ff■\, and 

• fi, i = 1, . . . ,n, are i.i.d. zero-mean Gaussian random vectors with population Gram 
matrix E[/j/,'] has ones on the diagonal, its slogn-sparse eigenvalues bounded from 
above and away from zero, and slogn = o{n/ log p); 

• fi, i = l,...,n, are i.i.d. bounded zero-mean random vectors with ||/j||oo ^ Kn 
a.s. with population Gram matrix E[/j/j'] has ones on the diagonal, its slogn-sparse 
eigenvalues bounded from above and away from zero, y/njKn — s- oo, and slogn = 
o((l/K„)Vn/logp). 
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Recall that a standard assumption in econometric research is to assume that the population 
Gram matrix E[/j/j'] has eigenvalues bounded from above and below, see e.g. |25j. The 
conditions above allow for this and much more general behavior, requiring only that the sparse 
eigenvalues of the population Gram matrix E[/j/j'] are bounded from below and from above. 
The latter is important for allowing functions fi to be formed as a combination of elements 
from different bases, e.g. a combination of B-splines with polynomials. The lemmas above 
further show that under some restrictions on the growth of s in relation to the sample size 
n, the good behavior of the population sparse eigenvalues translates into a good behavior of 
empirical sparse eigenvalues, which ensures that Conditions RE and SE are satisfied in large 
samples. 

4.2. Results on Sparse Estimators under Gaussian Errors. Next we gather rate of 
convergence results for the different sparse estimators discussed in Section 3. We begin with 
the rates for LASSO and Post-LASSO. 

Lemma 1 (Rates for LASSO and Post-LASSO). Suppose we have the sample of size n from 
the model y2i = D{xi) + Vi,i = 1, .., n where Xi,i = 1, n are fixed, and Vi,i = 1, n are i.i.d 



Gaussian with variance a^. Suppose that the approximate sparsity condition (1.5) holds for 



the function D{xi) with respect to fi, and that Conditions RE and SE hold for M = E„[/j/j']. 



Suppose the penalty level for LASSO is specified as in (3.11) with 7 = o(l) as n grows. Then, 
as n grows, for f3 defined as either the LASSO or Post-LASSO estimator and the associated 
fit Di = f'^ 



\Di — A||2,n <P CFv 



slog(p/7) 



n 



slog(p/7) 



n 



s2 log(p/7) 



n 



The following lemma derives the properties for VLASSO and Post-\/LASSO. 

Lemma 2 (Rates for VLASSO and Post-VLASSO) . Suppose we have the sample of size n 
from the model y2i = D{xi) + Vi,i = 1, .., n where Xi, i = 1, n are fixed, and Vi,i = 1, n 



are i.i.d Gaussian with variance a^. Suppose that the approximate sparsity condition (1.5) 



holds for the function D[xi) with respect to fi, and that Conditions RE and SE hold for 
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M = E„[/j/j']. Suppose the penalty level for V LASSO is specified as in (3.20) with 7 = o(l) 
as n grows. Then, as n grows, provided slog{p/j) = o{n), (3 defined as either the \/ LASSO or 
Post-yjLASSO estimator and the associated fit Di = /'/3 satisfy 



in n II < ^ . slog{p/-f) 



n 



/slog(p/7) 



n 

s2 log(p/7) 



n 



Although all these estimators enjoy similar rates, their practical performance in finite sample 
can be relatively different. As mentioned before, Post-LASSO aims to reduce the regularization 
bias introduced by LASSO. This is typically desirable if LASSO generated a sufficiently sparse 
estimator so that Post-LASSO does not overfit. However, LASSO (and therefore Post-LASSO) 
relies on the knowledge or pre-estimation of the standard deviation of the disturbances 
Vi. \/LASS0 circumvent that at the cost of a mild side condition having to hold (typically 
slogp = o(n)). Finally, Post-VLASSO aims to remove the shrinkage bias inherent in VLASSO. 

Based on these lemmas we can achieve the results in Theorem [l] based on primitive assump- 
tions for either of these four estimators. 

Theorem 3 (Asymptotic Normality for IV Based on LASSO, Post-LASSO, VLASSO and 
Post-VLASSO) . In the linear IV model of Section 1, assume that ay, and the eigenvalues 
of Qn = ^ni-^i^'il o,re bounded away from zero and from above uniformly in n. Suppose that 



the optimal instrument is approximately sparse, namely (L5) holds, conditions RE, SE hold 
for M = E„[/j/j'], 7 = 1/p = 0(1), and s^log^p = o{n) hold, and let Di = f-P where /3 is the 
LASSO, Post-LASSO, V LASSO or Post-^J LASSO estimator. Then the IV estimator based on 



the equation (1.8) is ^/n- consistent and is asymptotically oracle- efficient, namely as n grows: 

and the result continues to hold with Qn replaced by Qn = and by a1 = E„[(yij — 

A[a*)% 



In the analysis of split-sample IV recall that we re-normalized the technical regressors in 
the subsamples so that '^uaifi?] = 1 for j = 1, . . . ,p and E„j^[/|'^] = 1 for j = 1, . . . ,p, and the 
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LASSO estimators are applied to such samples. Letting = diag(af, . . . we have from 



condition (1.5) that 

) = fi'Po + a{x'l) = ft'Hkf^o + a(xf ), k = a,b, (4.25) 

so that an approximate sparse model for each subsample follows from the approximate sparse 
model Po for the full sample times the appropriate diagonal matrix containing the normaliza- 
tions needed to enforce the normalizations above. 

Theorem 4 (Asymptotic Normality for Split-Sample IV Based on LASSO, Post-LASSO, 
\/LASS0 and Post-\/LASSO) . In the linear IV model of Section 1, assume that a^, and 
the eigenvalues of Qn = are bounded away from zero and from above uniformly in n. 



Suppose that the optimal instrument is approximately sparse, namely (1.5) holds, conditions 
RE, SE hold for = En^^lfj'f^'] for k = a,b, j = 1/p = o(l), and slogp = o{n) hold, and 
let = ff'HkH~}Ji^'' where is the LASSO, Post-LASSO, VLASSO or Post-V LASSO 
estimators applied to the subsample {(^24^,/^^) : 1 ^ i ^ n^c} for k = a,b, and k'^ = {a, 6} \ 



k. Then the split-sample IV estimator based on the equation (1.11) is y/n- consistent and is 
asymptotically oracle- efficient, namely as n grows: 

{alQ-^)-^'^^{aab - ao) =d A^(0, /) + op(l), 



and the result continues to hold with Qn replaced by Qn = and a1 by = Kn[{yu 



Theorems 3 and 4 verify that the conditions required in the generic results given in Theorems 
1 and 2 are satisfied when LASSO, VLASSO, post-LASSO, or post-VLASSO are used to 
estimate the optimal instruments. The conditions in the two theorems are quite similar. 
As mentioned above, the condition on sparsity embodied by the rate condition slog(p) = 
o(n) in Theorem 4 is weaker than the analogous condition in Theorem 3. Both results also 
impose restrictions on the empirical design matrices, M in Theorem 3 and for k = a,b in 
Theorem 4. These conditions are similar to, but weaker than, the usual full-rank condition 
for estimating linear models via ordinary least squares. Both theorems also implicitly assume 
that identification is strong, i.e. that D{xi) is bounded away from the zero-function. 
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belloni chernozhukov hansen 
5. Simulation Experiment 



The theoretical results presented in the previous sections suggest that using LASSO to 
aid in fitting the first-stage regression should result in IV estimators with good estimation and 
inference properties. In this section, we provide simulation evidence on estimation and inference 
properties of IV estimators using LASSO and \/LASSO to select instrumental variables for a 
second-stage estimator. All results reported in this section are for post-LASSO and post- 
\/LASSO but we refer to LASSO or VLASSO to simplify the presentation. 

Our simulations are based on a simple instrumental variables model of the form 

yi = adi + Si 
di = z'jU + Vi 



where a = 1 is the parameter of interest, and Zi = {zn, Zi2, zhqq)' ~ A''(0, S^) is a 100 x 



For the other parameters, we use a variety of different parameter settings. We provide 
simulation results for sample sizes, n, of 101 and 500. We consider two different values for 
Corr{e,v): .3 and .6. We also consider three values of cr^ which are chosen to benchmark 
three different strengths of instruments. The three values of cr^ are found as a1 = "p*nai^ 
for three different values of F*: 10, 40, and 160. Finally, we use two different settings for 
the first stage coefficients, 11. The first sets the first five elements of 11 equal to one and the 
remaining elements equal to zero. We refer to this design as the "cut-off" design. The second 
model sets the coefficient on zih = .1^^^^ for /i = 1, 100. We refer to this design as the 
"exponential" design. In the cut-off case, the first-stage has an exact sparse representation, 
while in the exponential design, the model is not literally sparse although the majority of 
explanatory power is contained in the first few instruments. 

For each setting of the simulation parameter values, we report results from several estimation 
procedures. A simple possibility when presented with p < n instrumental variables is to 
just estimate the model using 2SLS and all of the available instruments. It is well-known 
that this will result in poor-finite sample properties unless there are many more observations 
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than instruments; see, for example, [8J. Fuller's estimator [T7| (FULL|^ is robust to many 
instruments as long as the presence of many instruments is accounted for when constructing 
standard errors for the estimators and p < n; see [8j and [20] for example. We report results for 
these estimators in rows labeled 2SLS(100) and FULL (100) respectively. In addition, we report 
Fuller and IV estimates based on the set of instruments selected by LASSO or \/LASSO with 
two different penalty selection methods. IV-LASSO, FULL-LASSO, IV-SQLASSO, and FULL- 
SQLASSO are respectively 2SLS and Fuller using instruments selected by LASSO and 2SLS 
and Fuller using instruments selected by VLASSO using the simple plug-in penalties given 
in Section 3. IV-LASSO-CV, FULL-LASSO-CV, IV-SQLASSO-CV, and FULL-SQLASSO- 
CV are respectively 2SLS and Fuller using instruments selected by LASSO and 2SLS and 
Fuller using instruments selected by VLASSO using the 10-fold cross-validation to choose the 
penalty level. For each estimator, we report root-mean-squared-error (RMSE), median bias 
(Med. Bias), mean absolute deviation (MAD), and rejection frequencies for 5% level tests 
(rp(.05)). For computing rejection frequencies, we estimate conventional 2SLS standard errors 
for all 2SLS estimators, and the many instrument robust standard errors of [20j for the Fuller 
estimators. 

Simulation results are presented in Tables 1-4. Tables 1-2 give results for the cut-off design 
with n = 101 and n = 500 respectively; and Tables 3-4 give results for the exponential design 
with n = 101 and n = 500 respectively. As expected, 2SLS(100) does extremely poorly along all 
dimensions. FULL (100) also performs worse than the LASSO- and VLASSO-based estimators 
in terms of estimator risk (RMSE and MAD) in all cases. With n = 500, FULL(IOO) is on par 
with the LASSO- and VLASSO-based estimators in terms of rejection frequencies (rp(.05)) 
but tends to perform much worse than these with n = 101. 

All of the LASSO- and \/LASSO-based procedures perform similarly in the two examples 
with n = 500. Outside of outperforming the two procedures that use all of the instruments, 
there is little that systematically differentiates the various estimators looking at RMSE and 
MAD. There appears to be a tendency for the estimates with variable selection done with 
the simple plug-in penalty to have slightly smaller RMSE and MAD than the estimates based 
on using cross-validation to choose the penalty, though the pattern is not striking. Looking 
at median bias, the Fuller estimator has uniformly smaller bias than the associated 2SLS 



The Fuller estimator requires a user-specified parameter. We set this parameter equal to one which produces 
a higher-order unbiased estimator. See |19j for additional discussion. 
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estimator that uses the same instruments as predicted by the theory for the FuUer estimator. 
That this does not equate to uniformly smaller MAD or RMSE is of course due to the fact that 
Fuller estimator is slightly more variable than 2SLS. Finally, all estimators do fairly well in 
terms of 95% coverage probabilities, though once again the Fuller-based tests have uniformly 
smaller size-distortions than the associated 2SLS tests using the same instruments. Tests that 
use instruments selected by cross-validation also do worse in terms of coverage probabilities 
than the tests that use the simple plug-in rule. This difference is especially pronounced for 
small values of F* and goes away as F* increases. 

The major qualitative difference in the results for the LASSO- and \/LASSO-based proce- 
dures with n = 101 as compared with n = 500 are in the numbers of cases in which the variable 
selection methods choose no instruments. With n = 101, we see that VLASSO tends to be 
more conservative in instrument selection than LASSO and that, unsurprisingly, CV tends to 
select more variables than the more conservative plug-in rule. For example, with F* = 10 and 
Corr(e,v) = .3 in the exponential design, LASSO and \/LASS0 with the plug-in penalty select 
no instruments in 122 and 195 cases while LASSO and VLASSO using 10-fold cross-validation 
select no instruments in only 27 and 30 cases. Outside of this, the same basic patterns for 
RMSE, MAD, median bias, and rejection probabilities discussed in the n = 500 case continue 
to hold. 

Overall, the simulation results are favorable to the LASSO- and \/LASSO-based IV methods. 
The LASSO- and VLASSO-based estimators dominate the other estimators considered based 
on RMSE or MAD and have relatively small finite sample biases. The LASSO- and VLASSO- 
based procedures also do a good job in producing tests with size close to the nominal level. 
There is some evidence that the Fuller estimator using instruments selected by LASSO may 
do better than the more conventional 2SLS estimator in terms of testing performance. In 
the designs considered, it also seems that the simple plug-in rule may produce estimates that 
behave slightly better than those obtained by using cross-validation to choose the LASSO and 
\/LASS0 penalty levels. It may be interesting to explore these issues in more depth in future 
research. 



lasso methods for gaussian instrumental variables models 
6. Instrument Selection in Angrist and Krueger Data 
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Next we apply post-LASSO in the Angrist and Krueger [6] model 

yn = diyi2 + U7i7 + E[ei\wi,Xi] = 0, 
yi2 = z[j3 + w[5 + Vi, Y,[vi\wi,Xi] = 0, 

where yn is the log(wage) of individual i, yi2 denotes education, wi denotes a vector of control 
variables, and Xi denotes a vector of instrumental variables that affect education but do not 
directly affect the wage. The data were drawn from the 1980 U.S. Census and consist of 329,509 
men born between 1930 and 1939. In this example, Wi is a set of 510 variables: a constant, 
9 year-of-birth dummies, 50 state-of-birth dummies, and 450 state-of-birth x year-of-birth 
interactions. As instruments, we use three quarter-of-birth dummies and interactions of these 
quarter-of-birth dummies with the set of state-of-birth and year-of-birth controls in wi giving 
a total of 1530 potential instruments. [6j discusses the endogeneity of schooling in the wage 
equation and provides an argument for the validity of Zi as instruments based on compulsory 
schooling laws and the shape of the life-cycle earnings profile. We refer the interested reader 
to [6] for further details. The coefficient of interest is 0i, which summarizes the causal impact 
of education on earnings. 

There are two basic options that have been used in the literature: one uses just the three 
basic quarter-of-birth dummies and the other uses 180 instruments corresponding to the three 
quarter-of-birth dummies and their interactions with the 9 main effects for year-of-birth and 
50 main effects for state-of-birth. It is commonly-held that using the set of 180 instruments 
results in 2SLS estimates of 6i that have a substantial bias, while using just the three quarter- 
of-birth dummies results in an estimator with smaller bias but a larger variance; see, e.g., 
[20j . Another approach uses the 180 instruments and the Fuller estimator [17] (FULL) with an 
adjustment for the use of many instruments. Of course, the sparse methods for the first-stage 
estimation explored in this paper offer another option that could be used in place of any of 
the aforementioned approaches. 

Table 5 presents estimates of the returns to schooling coefficient using 2SLS and FULlfl and 
different sets of instruments. Given knowledge of the construction of the instruments, the first 
three rows of the table correspond to the natural groupings of the instruments into the three 

"^We set the user-defined choice parameter in the Fuller estimator equal to one which results in a higher-order 
unbiased estimator. 
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main quarter of birth effects, the three quarter-of-birth dummies and their interactions with 
the 9 main effects for year-of-birth and 50 main effects for state-of-birth, and the full set of 
1530 potential instruments. The remaining two rows give results based on using LASSO to 
select instruments with penalty level given by the simple plug-in rule in Section 3 or by 10-fold 
cross- validation]^ Using the plug-in rule, LASSO selects only the dummy for being born in the 
fourth quarter, and with the cross-validated penalty level, LASSO selects 12 instruments which 
include the dummy for being born in the third quarter, the dummy for being born in the fourth 
quarter, and 10 interaction terms. The reported estimates are obtained using post-LASSO. 

The results in Table 5 are interesting and quite favorable to the idea of using LASSO to 
do variable selection for instrumental variables. It is first worth noting that with 180 or 1530 
instruments, there are modest differences between the 2SLS and FULL point estimates that 
theory as well as evidence in |20j suggests is likely due to bias induced by overfitting the 2SLS 
first-stage which may be large relative to precision. In the remaining cases, the 2SLS and 
FULL estimates are all very close to each other suggesting that this bias is likely not much 
of a concern. This similarity between the two estimates is reassuring for the LASSO-based 
estimates as it suggests that LASSO is working as it should in avoiding overfitting of the 
first-stage and thus keeping bias of the second-stage estimator relatively small. 

For comparing standard errors, it is useful to remember that one can regard LASSO as a 
way to select variables in a situation in which there is no a priori information about which of 
the set of variables is important; i.e. LASSO does not use the knowledge that the three quarter 
of birth dummies are the "main" instruments and so is selecting among 1530 a priori "equal" 
instruments. Given this, it is again reassuring that LASSO with the more conservative plug- 
in penalty selects the dummy for birth in the fourth quarter which is the variable that most 
cleanly satisfies Angrist and Krueger's |6j argument for the validity of the instrument set. With 
this instrument, we estimate the returns-to-schooling to be .0862 with an estimated standard 
error of .0254. The best comparison is FULL with 1530 instruments which also does not use 
any a priori information about the relevance of the instruments and estimates the returns- 
to-schooling as .1019 with a much larger standard error of .0422. In the same information 
paradigm, one can be less conservative than the plug-in penalty by using cross-validation to 
choose the penalty level. In this case, only 12 instruments are chosen producing a Fuller 



^Due to the similarity of the performance of LASSO and VLASSO in the simulation, we focus only on LASSO 
results in this example. 
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point estimate (standard error) of .0997 (.0139) or 2SLS point estimate (standard error) of 
.0982 (.0137). These standard errors are smaller than even the standard errors obtained using 
information about the likely ordering of the instruments given by using 3 or 180 instruments 
where FULL has standard errors of .0200 and .0143 respectively. That is, LASSO finds just 12 
instruments that contain nearly all information in the first stage and, by keeping the number 
of instruments small, produces a 2SLS estimate that likely has relatively small biasj^ Overall, 
these results demonstrate that LASSO instrument selection is both feasible and produces 
sensible and what appear to be relatively high-quality estimates in this application. 
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Appendix A. Notation. 

We allow for the models to change with the sample size, i.e. we allow for array asymptotics. 
Thus, all parameters are implicitly indexed by the sample size n, but we omit the index to 
simplify notation. We use array asymptotics to better capture some finite-sample phenomena. 
We also use the following empirical process notation, 

n 

E„[/]=E„[/(z,)] = ^/(z,)/n, 

and 

n 

Gnif) = Y.{f{z.) - E[f{z,)])/Vn. 
1=1 

The /2-iiorm is denoted by || • ||, and the /o-i^orm, || • ||o, denotes the number of non-zero 
components of a vector. The empirical L^(P„) norm of a random variable Wi is defined as 

\\Wi\\2,n ■■= ^J^n[W^]. 

Given a vector 6 £ M^, and a set of indices T C {1, . . . we denote by 5t the vector in which 
Stj = Sj if j G T, 5Tj = if j ^ T. We use the notation (a)+ = max{a, 0}, aV b = max{a, b} 

®Note that it is simple to modify LASSO to use a priori information about the relevance of instruments by 
changing the weighting on different coefficients in the penalty function. For example, if one uses the plug-in 
penalty and simultaneously decreases the penalty loading on the three main quarter of birth instrument to reflect 
beliefs that these are the most relevant instruments, one chooses only the three quarter of birth instruments. 
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and a Ab = min{a, 6}. We also use the notation a < 6 to denote a ^ cb for some constant 
c > that does not depend on n; and a <p b to denote a = Op{b). For an event E, we 
say that E wp — t- 1 when E occurs with probability approaching one as n grows. We say 
Xfi =d Yn + op{l) to mean that X„ has the same distribution as Yn up to a term op(l) that 
vanishes in probability. Such statements are needed to accommodate asymptotics for models 
that change with n. When 1^ is a fixed random vector, that does not change with n, i.e. 
Yn = Y, this notation is equivalent to Xn — Y. 

Appendix B. Proof of Theorem 1 

Step 0. Recall that Ai = {Di,w[)' and di = {y2i,w'^y for i = 1, . . . , n. The condition that 
IE„[Ai^^] = Qn has eigenvalues bounded from above uniformly in n implies that 

En[D^]+En[\\Wif]=En[\\Aif]=tl&Ce{Qn) < {1 + kyj) 

is bounded from above uniformly in n. 

Also, we have E„[^iej] ^ N{0,a'^Qn/n) and E„[^j?;j] ~ N{0,a^Qn/n) so that 
\\En[di€i]\\ ^ \En[viei]\ + \\En[Aiei]\\ <p + V(l + K)/n 

\\En[AiVi]f = \En[DiVi]\^ + \\En[wiVi]f <p ^^(l + fc^)/n 
\\di\\2,n ^ \\vi\\2,n + ||A||2,n (^v + \/ 1 + 

where and ky, are bounded from above uniformly in n. 
Step 1. We have that by ^[ei\Ai] = 

= {En[Ad^}-^Gn[A,e,] 

= {EniAd'^ + op(l)}-i {GnlAei] + Op(l)) 

where by Steps 2 and 3 below: 

En[Aid'i\ = En[Aid'i\ + op(l) (B.26) 
GnlAei] = GniAiCi] + op(l). (B.27) 

Thus, since E„[Dj(y2i — Di)] = op(l) and En[wi{y2,i — Di)] = op(l) by Step 0, note that 
E„[^i(i^] = E„[^i^^] + op(l) = Qn + op(l). Moreover, by the assumption on (jg and Qn, 
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Var{Gn[^i^i]) = (^iQri has eigenvalues bounded away from zero and bounded from above, 
uniformly in n. Therefore, 



/n(a* - ao) = (5„^G„[Ajei] + op(l), 
and Q^'^GnlAiej] is a vector distributed as normal with mean zero and covariance a'^Q^^. 



Step 2. To show ( |B.26| , note that Ai - Ai = {Di - Di,0'y. Thus, 

\\En[{Ai - Ai)d[]\\ = \\En[{D, - Di)d'i]\\ ^ E„[||A- AlllMi 



<p 



A — Alb,™ • Milb.n 
A - A||2,n = Op(l) 



since ||cii||2,n 1 by Step 0, and the rate assumption (2.14). 



Step 3. To show (B.27), note that 

\\Gn[{A-Ai)eM = 



|G„[(A- A)q]| 

\Gn{fl0-/3o)ei} + Gn{aiei}\ 
p ^ 

G-nifijeiYil^j - Poj) + Gn{ai€i} 



Jn(/iej)||c 



/3o||i + |G„{aiei}| -^p 



by condition (2.14) since |G„{aiej}| <p Cgde and fie is bounded above uniformly in n. 

Step 4. This step establishes consistency of the variance estimator in the homoscedastic 
case. Since cr^ and Qn = En[AiA'jj are bounded away from zero and from above uniformly in 
n, it suffices to show a"^ - a'^ —?-p and E„[Aj^-] - E„[Aj^-] -^p 0. 

Indeed, a^, = E„[(e, - df.{a* - ao)f] = En[ej] + 2En[e^d[{ao - a*)] + E„[«(ao - S*))^] so 
that E„[e?] — o"^ — )-p by Chebyshev inequality since E[e^] is bounded uniformly in n, and the 
remaining terms converge to zero in probability since a* — ckq — t-p 0, ||E„[djei]|| <p 1 by Step 
0. 

Next, note that 

\\En[AiA'^ - E„[Ai4]|| = \\En[A{Ai - A.,)' + {Ai - ^,)^',] + E„[(A - A^){Ai - Ai)']]] 
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which is bounded up to a constant by 

|2 



— ^i||2,n||^i||2,?i + ll^i — ^i||2,n 



SmCG 1 1 -/i^ 1 1 2 Ti 



lA-AIlL 



op(l) by (2.14), and ||Aj||2,n < 1 holding by Step 0. □ 



Appendix C. Proof of Theorem 2 



Step 0. The step here is identical to Step of the proof of Theorem 1, whereby we introduce 
additional indices a and b on all variables; and n gets replaced by either Ua or ni,. 

Step 1. We have that by E[ef = for both k = a and k = b, 



where 



E„JAH"]=IEn,[^fA"] + op(l) 
G„J4^'=] = G„JAfef]+op(l). 



(C.28) 
(C.29) 



where (C.28) follows similarly to Step 2 in the proof of Theorem 1 and condition (2.15). The 



relation (C.29) follows from Chebyshev inequality and 



E[\\Gn,U^ - A^)4]\?\k'] < - ^f)lli,„, 

where we used that (A^ — A'^), 1 ^ i ^ by construction are independent of , 1 ^ i ^ and 
that \\{Af - A^)\\2,n^, ^ \\{Df - -Df)||2,nfc -^p 0, where i?[-|A;'^] denotes the estimate computed 
conditional on the sample k^, where k'^ = {a, b} \ k. 

By assumption eigenvalues of E„j, [yl^^J^'] and are bounded away and above from zero, 
and so we can conclude that 



= {E„J4'=Af]}-i/2a,Z, + op(l) 

where Za and Zt, are two independent A^(0, /) vectors; and also note that y/nk{ak — ao) 
Op(l), for k = a,b. 
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Step 3. Now putting together terms we get 

V^(Safe - «o) = ((n„/n)E„jl?If ] + (nb/n)E„jl^lf])-i x 

X {{na/n)EnAA^Af]^{aa - ao) + (nb/n)E„J4^1f ] V^(Sfe - ao)) 

= (K/n)E„J^?< ] + {n,/n)EnMtAf])-' x 

X ((na/n)E,„ [A'^Af]V^{aa - ao) + (nb/n)E„, [A'lAf]V^{ab - ao)) + op(l) 

= {En[AiA^']}-^ X 

X {^Er,MtAtf^'V2a,Za + ^E„J4^ylf]i/2V2a,Z,) + op(l) 

= {E„[^,^/]}-iiV(0, ^E„J^?<] + ^E^^lA^Af]) + op(l) 
= {E„[yliyl/]}-iiV(0, aXi^^^i]) + op(l) 
= iV(0,a,HE„[^,A',]}-i) + op(l) 

The conclusions now follow as in the proof of Theorem 1. 

Step 3. This step is similar to Step 4 in the proof of Theorem 1. □ 

Appendix D. Proof of Lemma 2 (Rates for LASSO and Post-LASSO) 

Note that ||A-A||2,n ^ ||/;(^-/3o)||2,n + ||a(2;i)||2,n ^ ||/K^-/3o)||2,n + c.. Let5:=^-/3o 
and Co = (c + l)/(c — 1). 

First consider the LASSO estimator. By optimality of the LASSO estimator and expanding 
Q(/3) — Qif^o), if A ^ cn||5||oo we have 

Wmin ^ -(II<^t||i-Ptc||i) + ||5|UI|5||i + 2c,||/;<5||2,„ 



\ c J n \ c J n 



That yields that either ||/j'5||2,n ^ 2cs or that ||5t=||i ^ co||5t||i- As shown Lemma 1 of [S] 
(which is based on [U]) if A ^ cn||S||oo we have 



llf'AII ^ ^^^^ I o„ ^ ^ js\0g{ph) 
||jiC>||2,n ^ h 2Cs < fT^,1 



n 

since Kcq is bounded away from zero by condition RE as n — )■ oo and the choice of penalty level 



(3.17). Note that under the penalty choice (3.17) we have that A ^ 

c?7, 1 1 /S* 1 1 oo with probcibility 



1 — 7 — 1 since 7 = o(l). 
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Under our conditions, we can invoke sparsity bound for LASSO by Theorem 5 in [9|. We 
have that 

Pllo <p s. 

Therefore, we have 

||<^||2 ^ ||/;<5||2,n/V'/'min(|H|o) <P \\f'M2,n 

since for any fixed C > 0, (pminiCs) is bounded away from zero as n grows by condition SE. 

To estabhsh the last result, if \\S\\o <p s, it follows that \\S\\i ^ \/ll<%)||<^||2 v^ll^lb- 

The proof for the post-LASSO estimator follows from Theorem 6 in [9] and the sparsity 
bound for LASSO in Theorem 5 in [9]. 

Appendix E. Proof of Lemma 3 (Rates for VLASSO and Post-\/LASSO) 
The proof for the VLASSO and the Post-VLASSO estimator follows from |T3l I12j. 

Appendix F. Proof of Theorem 3 
First note that by a union bound and tail properties of the Gaussian random variables, see 

e.g. m, 

l|G„,(/iei)||^ <P cr.ybg]^ 

since ~ A^(0, erf) and E„,[/j?] = 1 for j = 1, . . . ,p. Under the condition that log^p = o(n), 
the result follows by applying the rates in Lemma 1 (for LASSO and Post-LASSO) and Lemma 
2 (for VLASSO and Post-VLASSO) to verify condition ( |2.14| ) in Theorem [ll 



Appendix G. Proof of Theorem 4 

For every observation i in the subsample k we have 

A = fiPo + a(x.) = ft'Hk^o + a{xi), \\HkPo\\o ^ s 
so that Higf^Q is the target vector for the renormalized subsample k. 

Under our conditions, we can invoke sparsity bound for LASSO by Theorem 5 in [9j and for 



VLASSO by [12j. In either case, we have that for 5 = fi^ — Hj^Pq, k = a,b, 

\Mo ^p s. 
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Therefore, by condition SE, we have for M = E„J/f/f' ], E„J/f/f] or E„[/i/i'], that with 
probabihty going to 1, for n large enough we have 



< k' ^ 



)) ^ ct>. 



o) ^ k" < oo. 



Therefore, we have ||-fffcc/3o — P^^Wq <p s and 



(G.31) 



r-l| 



where the last inequality holds with probability going to 1. Moreover, note that \\HkH'^c \\oo 



\/</'max(l)/</'min(l) < 1 by Condition SE. 



Then, under (1.5) and slogp = o(l), the result is an immediate consequence of Theorem 



2 since (2.15) holds by (G.31) combined with Lemma 1 (for LASSO and Post-LASSO) and 
Lemma 2 (for VLASSO and PostVLASSO) that imply 



||/f(^'=-/7fc/3o)||2,n, = op(l), k = a,h. 
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Table 4: Simulation Results. Cut-Off Design. N = 101 







Corr(e,v) 


= .3 






Corr(e,v) 


= .6 




Estimator 


RIVISE 


Med. Bias 


iVIAD 


rp(.05) 


RIVISE 


iVIed. Bias 


iVIAD 


rp(.05) 












F* = 10 








2SLS(100) 


0.046 


0.043 


0.043 


0.718 


0.085 


0.084 


0.084 


1.000 


FULL(IOO) 


1.005 


0.028 


0.155 


0.098 


1.908 


0.051 


0.145 


0.172 


IV-LASSO 


0.033 


0.010 


0.022 


0.048 


0.033 


0.014 


0.022 


0.098 


FULL-LASSO 


0.032 


0.010 


0.022 


0.042 


0.032 


0.015 


0.022 


0.090 


IV-SQLASSO 


0.033 


0.010 


0.022 


0.050 


0.034 


0.016 


0.023 


0.100 


FULL-SQLASSO 


0.032 


0.010 


0.021 


0.042 


0.034 


0.017 


0.024 


0.092 


IV-LASSO-CV 


0.031 


0.013 


0.021 


0.074 


0.036 


0.021 


0.026 


0.186 


FULL-LASSO-CV 


0.031 


0.011 


0.022 


0.058 


0.033 


0.017 


0.024 


0.118 


IV-SQLASSO-CV 


0.031 


0.013 


0.021 


0.068 


0.036 


0.021 


0.027 


0.190 


FULL-SQLASSO-CV 


0.031 


0.012 


0.021 


0.046 


0.033 


0.017 


0.024 


0.114 












F* =40 








2SLS(100) 


0.047 


0.040 


0.040 


0.398 


0.087 


0.084 


0.084 


0.954 


FULL(IOO) 


1.895 


0.010 


0.130 


0.062 


1.558 


0.029 


0.132 


0.114 


IV-LASSO 


0.031 


0.002 


0.021 


0.056 


0.032 


0.010 


0.024 


0.068 


FULL-LASSO 


0.031 


0.000 


0.021 


0.056 


0.031 


0.008 


0.022 


0.056 


IV-SQLASSO 


0.031 


0.002 


0.021 


0.058 


0.031 


0.010 


0.023 


0.062 


FULL-SQLASSO 


0.032 


0.001 


0.021 


0.056 


0.031 


0.008 


0.022 


0.056 


IV-LASSO-CV 


0.030 


0.004 


0.021 


0.058 


0.033 


0.013 


0.025 


0.102 


FULL-LASSO-CV 


0.031 


0.002 


0.021 


0.050 


0.031 


0.009 


0.022 


0.070 


IV-SQLASSO-CV 


0.030 


0.003 


0.021 


0.058 


0.033 


0.013 


0.024 


0.108 


FULL-SQLASSO-CV 


0.031 


0.002 


0.021 


0.048 


0.031 


0.008 


0.022 


0.074 












F* = 160 








2SLS(100) 


0.041 


0.029 


0.030 


0.188 


0.053 


0.057 


0.057 


0.548 


FULL(IOO) 


4.734 


0.010 


0.089 


0.048 


2.148 


0.009 


0.080 


0.094 


IV-LASSO 


0.032 


0.002 


0.022 


0.058 


0.030 


0.003 


0.019 


0.058 


FULL-LASSO 


0.032 


0.001 


0.022 


0.056 


0.030 


0.002 


0.020 


0.054 


IV-SQLASSO 


0.032 


0.002 


0.022 


0.058 


0.030 


0.003 


0.019 


0.054 


FULL-SQLASSO 


0.032 


0.001 


0.022 


0.056 


0.030 


0.001 


0.020 


0.050 


IV-LASSO-CV 


0.031 


0.003 


0.023 


0.058 


0.031 


0.005 


0.020 


0.064 


FULL-LASSO-CV 


0.032 


0.001 


0.022 


0.058 


0.030 


0.002 


0.020 


0.054 


IV-SQLASSO-CV 


0.031 


0.002 


0.023 


0.060 


0.031 


0.005 


0.020 


0.060 


FULL-SQLASSO-CV 


0.032 


0.001 


0.022 


0.060 


0.031 


0.002 


0.020 


0.058 


Note: Results are based on 500 simulation replications and 100 Instruments. The first five first-stage coefficients were set equal to one and the 


remaining 95 to zero in this design. Corr(e,v) is the correlation between first-stage and structural errors. F* measures the strength of the 



instruments as outlined in the text. 2SLS{100) and FULL{100) are respectively the 2SLS and Fuller{l) estimator using all 100 potential 
instruments. IV-LASSO and FULL-LASSO respectively correspond to 2SLS and Fuller{l) using the instruments selected by LASSO with the data- 
driven penalty. IV-SQLASSO and FULL-SQLASSO respectively correspond to 2SLS and Fuller(l) using the instruments selected by LASSO^^^ w\th 
the data-driven penalty. IV-LASSO-CV, FULL-LASSO-CV, IV-SQLASSO-CV, and FULL-SQLASSO-CV are defined similarly but use lO-fold cross- 



validation to select the penalty. We report root-mean-square-error (RMSE), median bias (Med. Bias}, mean absolute deviation (MAD), and 
rejection frequency for 5% level tests (rp(.05)). Many-instrument robust standard errors are computed for the Fuller(l) estimator to obtain 
testing rejection frequencies. In the w/eak instrument design (F* = 10), the number of simulation replications in which LASSO and LASSO^''^ with 
the data-driven penalty and LASSO and LASSO^''^ with penalty chosen by cross-validation selected no instruments are, for Corr{e,v) = .3 and .6 
respectively, 39 and 39, 75 and 80, 9 and 11, and 10 and 12. LASSO^''^ also selected no instruments in one replication with F* = 40 and Corr(e,v) 
= .6. In these cases, RMSE, Med. Bias, and MAD use only the replications where LASSO selects a non-empty set of instruments, and we set the 
confidence interval eqaul to {-°°,°°) and thus fail to reject. 
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Table 4: Simulation Results. Cut-Off Design. N = 500 












Corr(e,v) = .3 






Corr(e,v) 


= .6 




Estimator 


RMSE 


Med. Bias 


MAD 


rp(.05) 


RMSE 


Med. Bias 


MAD 


rp(.05) 












F* = 10 








2SLS(100) 


0.020 


0.018 


0.018 


0.670 


0.039 


0.038 


0.038 


1.000 


FULL(IOO) 


0.035 


-0.002 


0.015 


0.046 


0.028 


0.001 


0.014 


0.088 


IV-LASSO 


0.014 


0.002 


0.009 


0.050 


0.015 


0.007 


0.010 


0.104 


FULL-LASSO 


0.014 


0.003 


0.009 


0.046 


0.014 


0.007 


0.010 


0.094 


IV-SQLASSO 


0.014 


0.002 


0.009 


0.052 


0.014 


0.008 


0.010 


0.104 


FULL-SQLASSO 


0.014 


0.003 


0.009 


0.048 


0.014 


0.007 


0.010 


0.094 


IV-LASSO-CV 


0.013 


0.003 


0.008 


0.070 


0.016 


0.010 


0.012 


0.154 


FULL-LASSO-CV 


0.014 


0.003 


0.009 


0.052 


0.015 


0.008 


0.011 


0.114 


IV-SQLASSO-CV 


0.014 


0.003 


0.009 


0.066 


0.016 


0.011 


0.012 


0.160 


FULL-SQLASSO-CV 


0.014 


0.003 


0.009 


0.054 


0.015 


0.008 


0.010 


0.124 












F* = 40 








2SLS(100) 


0.021 


0.019 


0.019 


0.402 


0.040 


0.039 


0.039 


0.968 


FULL(IOO) 


0.017 


-0.001 


0.010 


0.050 


0.015 


0.001 


0.011 


0.035 


IV-LASSO 


0.013 


0.002 


0.008 


0.058 


0.013 


0.003 


0.009 


0.062 


FULL-LASSO 


0.013 


0.001 


0.009 


0.054 


0.013 


0.001 


0.009 


0.048 


IV-SQLASSO 


0.013 


0.001 


0.008 


0.058 


0.013 


0.003 


0.009 


0.062 


FULL-SQLASSO 


0.013 


0.001 


0.009 


0.054 


0.013 


0.001 


0.009 


0.046 


IV-LASSO-CV 


0.013 


0.002 


0.009 


0.062 


0.013 


0.004 


0.009 


0.074 


FULL-LASSO-CV 


0.013 


0.001 


0.009 


0.058 


0.013 


0.002 


0.009 


0.046 


IV-SQLASSO-CV 


0.013 


0.002 


0.009 


0.058 


0.014 


0.004 


0.009 


0.080 


FULL-SQLASSO-CV 


0.013 


0.001 


0.009 


0.052 


0.013 


0.002 


0.009 


0.046 












F* = 160 








2SLS(100) 


0.017 


0.012 


0.013 


0.160 


0.027 


0.025 


0.025 


0.522 


FULL(IOO) 


0.014 


0.000 


0.010 


0.058 


0.014 


-0.001 


0.009 


0.050 


IV-LASSO 


0.013 


0.001 


0.009 


0.044 


0.013 


0.001 


0.008 


0.055 


FULL-LASSO 


0.013 


0.000 


0.009 


0.044 


0.013 


0.000 


0.008 


0.052 


IV-SQLASSO 


0.013 


0.001 


0.009 


0.044 


0.013 


0.001 


0.008 


0.056 


FULL-SQLASSO 


0.013 


0.000 


0.009 


0.044 


0.013 


0.000 


0.008 


0.052 


IV-LASSO-CV 


0.013 


0.001 


0.009 


0.046 


0.013 


0.001 


0.008 


0.056 


FULL-LASSO-CV 


0.013 


0.000 


0.010 


0.042 


0.013 


0.000 


0.008 


0.052 


IV-SQLASSO-CV 


0.013 


0.001 


0.009 


0.048 


0.013 


0.001 


0.008 


0.056 


FULL-SQLASSO-CV 


0.013 


0.000 


0.010 


0.042 


0.013 


0.000 


0.008 


0.052 


Note: Results are based on 500 simulation replications i 


and 100 instruments. The first five first-stage coefficients were set equal to one and the 


remaining 95 to zero in this design. Corr(e,v) is the correlation between first-stage and structural errors. F* measures the strength of the 



instruments as outlined in tlie text. 2SLS(100) and FULL(IOO) are respectively the 2SLS and Fuiier(l) estimator using aii 100 potentiai 



instruments. IV-U\SSO and FULL-IJ\SSO respectively correspond to 2SLS and Fuller{l) using the instruments selected by LASSO with the data- 
driven penalty. IV-SQLASSO and FULL-SQLASSO respectively correspond to 2SLS and Fuller(l} using the instruments selected by LASSO^**^ with 
the data-driven penalty. IV-LASSO-CV, FULL-LASSO-CV, IV-SQLASSO-CV, and FULL-SQLASSO-CV are defined similarly but use 10-fold cross- 
validation to select the penalty. We report root-mean-square-error (RMSE), median bias (Med. Bias), mean absolute deviation (MAD), and 
rejection frequency for 5% level tests {rp(.05)). Many-instrument robust standard errors are computed for the Fuller(l) estimator to obtain 
testing rejection frequencies. In the weak instrument design {F* = 10), the number of simulation replications in which LASSO and LASSO^^^ with 
the data-driven penalty and LASSO and LASSO^^^ with penalty chosen by cross-validation selected no instruments are, for Corr(e,v) = .3 and .6 
respectively, 8 and 5, 10 and 9, 1 and 1, and 1 and 1. In these cases, RMSE, Med. Bias, and MAD use only the replications where LASSO selects a 
non-empty set of instruments, and we set the confidence interval eqaul to {-°°,°°) and thus fail to reject. 
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Estimator 


RMSE 


Table 4: Simulation Results. Exponential Design. N = 101 
Corr(e,v) = .3 

Med. Bias MAD rp(.05) RMSE 


Corr(e,v) 
Med. Bias 


= .6 
MAD 


rp(.05) 












F* = 10 








2SLS(100) 


0.067 


0.063 


0.063 


0.750 


0.130 


0.128 


0.128 


1.000 


FULL(IOO) 


1.569 


0.038 


0.221 


0.108 


1.176 


0.106 


0.212 


0.218 


IV-LASSO 


0.053 


0.013 


0.037 


0.038 


0.058 


0.031 


0.042 


0.116 


FULL-LASSO 


0.052 


0.014 


0.037 


0.035 


0.057 


0.033 


0.043 


0.106 


IV-SQLASSO 


0.050 


0.014 


0.033 


0.032 


0.056 


0.032 


0.040 


0.096 


FULL-SQLASSO 


0.049 


0.016 


0.034 


0.030 


0.056 


0.034 


0.042 


0.092 


IV-LASSO-CV 


0.051 


0.022 


0.038 


0.086 


0.065 


0.048 


0.053 


0.274 


FULL-LASSO-CV 


0.051 


0.021 


0.038 


0.060 


0.060 


0.042 


0.047 


0.186 


IV-SQLASSO-CV 


0.051 


0.022 


0.037 


0.094 


0.066 


0.048 


0.053 


0.266 


FULL-SQLASSO-CV 


0.051 


0.021 


0.037 


0.068 


0.061 


0.042 


0.048 


0.198 












F* =40 








2SLS(100) 


0.081 


0.072 


0.072 


0.514 


0.150 


0.144 


0.144 


0.990 


FULL(IOO) 


1.653 


0.068 


0.223 


0.122 


2.826 


0.029 


0.231 


0.160 


IV-LASSO 


0.048 


0.011 


0.033 


0.044 


0.050 


0.011 


0.033 


0.052 


FULL-LASSO 


0.048 


0.010 


0.033 


0.038 


0.050 


0.009 


0.032 


0.050 


IV-SQLASSO 


0.048 


0.011 


0.033 


0.042 


0.051 


0.012 


0.034 


0.052 


FULL-SQLASSO 


0.048 


0.010 


0.033 


0.038 


0.050 


0.011 


0.033 


0.048 


IV-LASSO-CV 


0.049 


0.018 


0.034 


0.052 


0.055 


0.027 


0.038 


0.108 


FULL-LASSO-CV 


0.048 


0.014 


0.033 


0.036 


0.051 


0.017 


0.034 


0.074 


IV-SQLASSO-CV 


0.049 


0.018 


0.035 


0.048 


0.056 


0.027 


0.039 


0.116 


FULL-SQLASSO-CV 


0.049 


0.014 


0.033 


0.040 


0.051 


0.019 


0.036 


0.074 












F* = 150 








2SLS(100) 


0.070 


0.053 


0.055 


0.232 


0.115 


0.107 


0.107 


0.682 


FULL(IOO) 


2.570 


0.029 


0.170 


0.050 


1.250 


-0.004 


0.155 


0.102 


IV-LASSO 


0.051 


0.002 


0.035 


0.052 


0.051 


0.005 


0.034 


0.054 


FULL-LASSO 


0.051 


0.000 


0.034 


0.050 


0.051 


0.002 


0.033 


0.050 


IV-SQLASSO 


0.050 


0.004 


0.035 


0.060 


0.051 


0.006 


0.034 


0.058 


FULL-SQLASSO 


0.050 


0.002 


0.034 


0.058 


0.050 


0.002 


0.033 


0.050 


IV-LASSO-CV 


0.051 


0.009 


0.035 


0.068 


0.052 


0.015 


0.036 


0.072 


FULL-LASSO-CV 


0.051 


0.006 


0.034 


0.062 


0.050 


0.006 


0.034 


0.062 


IV-SQLASSO-CV 


0.051 


0.009 


0.035 


0.072 


0.052 


0.015 


0.035 


0.075 


FULL-SQLASSO-CV 


0.051 


0.006 


0.034 


0.064 


0.051 


0.007 


0.033 


0.052 



Note: Results are based on 500 simulation replications and 100 instruments. The first-stage coefficients were set equal to {.if^ for j=l 100 

denoting the associated instrument. Corr(e,v) is the correlation between first-stage and structural errors. F* measures the strength of the 
instruments as outlined in the text. 2SLS(100) and FULL(IOO) are respectively the 2SLS and Fuller(l) estimator using all 100 potential 
instruments. IV-LASSO and FULL-LASSO respectively correspond to 2SL5 and Fuller(l) using the instruments selected by LASSO with the data- 
driven penalty. IV-SQLASSO and FULL-SQLASSO respectively correspond to 2SLS and Fuller(l) using the instruments selected by LASSO^^^ with 
the data-driven penalty. IV-LASSO-CV, FULL-LASSO-CV, IV-SQLASSO-CV, and FULL-SQLASSO-CV are defined similarly but use 10-fold cross- 
validation to select the penalty. We report root-mean-square-error (RMSE), median bias (Med. Bias), mean absolute deviation (MAD), and 
rejection frequency for 5% level tests (rp(.05)). Many-instrument robust standard errors are computed for the Fuller(l) estimator to obtain 
testing rejection frequencies. In the weak instrument design (F* = 10), the number of simulation replications in which LASSO and LASSO"^ with 
the data-driven penalty and LASSO and LASSO"^ with penalty chosen by cross-validation selected no instruments are, for Corr(e,v) = .3 and .6 
respectively, 122 and 112, 195 and 193, 27 and 25, and 30 and 23. LASSO''^ also selected no instruments in one replication with F* = 40 and 
Corr(e,v) = .3 and .6. In these cases, RMSE, Med. Bias, and MAD use only the replications where LASSO selects a non-empty set of instruments, 
and we set the confidence interval eqaul to I-"","") and thus fail to reject. 



LASSO METHODS FOR GAUSSIAN INSTRUMENTAL VARIABLES MODELS 



Estimator 


RMSE 


Table 4: Simulation Results. Exponential Design. N = 500 
Corr(e,v) = .3 

Med. Bias MAD rp(.05) RMSE 


Corr{e,v) 
Med. Bias 


= .6 
MAD 


rp(.05) 












F* = 10 








2SLS(100) 


0.031 


0.029 


0.029 


0.774 


0.058 


0.057 


0.057 


1.000 


FULL(IOO) 


0.076 


0.004 


0.031 


0.048 


0.061 


0.002 


0.029 


0.084 


IV-LASSO 


0.025 


0.008 


0.016 


0.062 


0.025 


0.012 


0.017 


0.100 


FULL-LASSO 


0.024 


0.008 


0.016 


0.055 


0.025 


0.014 


0.017 


0.100 


IV-SQLASSO 


0.025 


0.008 


0.016 


0.064 


0.025 


0.013 


0.017 


0.100 


FULL-SQLASSO 


0.025 


0.008 


0.016 


0.060 


0.025 


0.014 


0.017 


0.102 


IV-LASSO-CV 


0.024 


0.011 


0.017 


0.090 


0.027 


0.017 


0.020 


0.202 


FULL-LASSO-CV 


0.024 


0.010 


0.017 


0.072 


0.026 


0.014 


0.018 


0.160 


IV-SQIASSO-CV 


0.024 


0.010 


0.016 


0.092 


0.027 


0.017 


0.020 


0.192 


FULL-SQLASSO-CV 


0.024 


0.010 


0.017 


0.074 


0.025 


0.015 


0.018 


0.152 












F* =40 








2SLS(100) 


0.038 


0.034 


0.034 


0.544 


0.068 


0.067 


0.067 


0.988 


FULL(IOO) 


0.031 


0.003 


0.020 


0.062 


0.030 


0.000 


0.022 


0.054 


IV-LASSO 


0.023 


0.004 


0.016 


0.068 


0.023 


0.008 


0.017 


0.070 


FULL-LASSO 


0.023 


0.003 


0.016 


0.055 


0.023 


0.005 


0.017 


0.050 


IV-SQLASSO 


0.023 


0.004 


0.016 


0.068 


0.023 


0.008 


0.017 


0.070 


FULL-SQLASSO 


0.023 


0.003 


0.016 


0.056 


0.023 


0.006 


0.017 


0.062 


IV-LASSO-CV 


0.023 


0.006 


0.016 


0.090 


0.026 


0.013 


0.019 


0.118 


FULL-LASSO-CV 


0.023 


0.004 


0.016 


0.074 


0.024 


0.009 


0.018 


0.078 


IV-SQLASSO-CV 


0.024 


0.006 


0.015 


0.090 


0.026 


0.013 


0.019 


0.120 


FULL-SQLASSO-CV 


0.023 


0.004 


0.015 


0.076 


0.024 


0.009 


0.018 


0.080 












F* = 160 








2SLS(100) 


0.031 


0.024 


0.024 


0.232 


0.052 


0.049 


0.049 


0.710 


FULL(IOO) 


0.025 


0.001 


0.017 


0.060 


0.024 


0.000 


0.015 


0.054 


IV-LASSO 


0.022 


0.002 


0.016 


0.052 


0.022 


0.003 


0.016 


0.048 


FULL-LASSO 


0.022 


0.001 


0.016 


0.046 


0.023 


0.001 


0.015 


0.050 


IV-SQLASSO 


0.022 


0.003 


0.015 


0.052 


0.022 


0.003 


0.016 


0.054 


FULL-SQLASSO 


0.022 


0.002 


0.016 


0.048 


0.023 


0.001 


0.015 


0.050 


IV-LASSO-CV 


0.023 


0.004 


0.015 


0.062 


0.023 


0.006 


0.016 


0.066 


FULL-LASSO-CV 


0.023 


0.003 


0.015 


0.054 


0.023 


0.002 


0.016 


0.052 


IV-SQLASSO-CV 


0.023 


0.004 


0.015 


0.060 


0.023 


0.005 


0.016 


0.064 


FULL-SQLASSO-CV 


0.023 


0.002 


0.015 


0.050 


0.023 


0.002 


0.016 


0.054 



Note: Results are based on 500 simulation replications and 100 instruments. The first-stage coefficients were set equal to (.if^ for j=l,...,100 
denoting the associated instrument. Corr(e,v) is the correlation between first-stage and structural errors. F* measures the strength of the 
instruments as outlined in the text. 2SLS{100) and FULL(IOO) are respectively the 2SLS and Fuller(l) estimator using all 100 potential 
instruments. IV-LASSO and FULL-LASSO respectively correspond to 2SLS and Fuller{l) using the instruments selected by LASSO with the data- 
driven penalty. IV-SQLASSO and FULL-SQLASSO respectively correspond to 2SLS and Fuller(l) using the instruments selected by LASSO^^^ with 
the data-driven penalty. IV-LASSO-CV, FULL-LASSO-CV, IV-SQLASSO-CV, and FULL-SQLASSO-CV are defined similarly but use 10-fold cross- 
validation to select the penalty. We report root-mean-square-error (RMSE), median bias (Med. Bias), mean absolute deviation (MAD), and 
rejection frequency for 5% level tests (rp(.05)). Many-instrument robust standard errors are computed for the Fuller(l) estimator to obtain 
testing rejection frequencies. In the weak Instrument design (F* = 10), the number of simulation replications in which LASSO and IJVSSO^^ with 
the data-driven penalty and IJ^SSO and LASSO^^ with penalty chosen by cross-validation selected no instruments are, for Corr(e,v) = .3 and .6 
respectively, 75 and 77, 86 and 85, 21 and 25, and 21 and 27. In these cases, RMSE, Med. Bias, and MAD use only the replications where LASSO 
selects a non-empty set of instruments, and we set the confidence interval eqaul to {-°°,°°) and thus fail to reject. 
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Table 5: Estimates of the Return to Schooling in Angrist 
and Krueger Data 

Number of 



Instruments 


2SLS Estimate 


2SLS Std. Error 


Fuller Estimate 


Fuller Std. Error 


3 


0.1079 


0.0196 


0.1087 


0.0200 


180 


0.0928 


0.0097 


0.1061 


0.0143 


1530 


0.0712 


0.0049 


0.1019 


0.0422 






LASSO - 


Plug-In 




1 


0.0862 


0.0254 










LASSO - 10-Fold Cross-validation 




12 


0.0982 


0.0137 


0.0997 


0.0139 



Note: This table reports estimates of the returns-to-schooling parameter in the Angrist-Krueger 
1991 data for different sets of instruments. The columns 2SLS and 2SLS Std. Error give the 2SLS 
point estimate and associated estimated standard error, and the columns Fuller Estimate and Fuller 
Std. Error give the Fuller point estimate and associated estimated standard error. We report Post- 
LASSO results based on instruments selected using the plug-in penalty given in Section 3 (LASSO - 
Plug-in) and based on instruments using a penalty level chosen by 10-Fold Cross-Validation (LASSO - 
10-Fold Cross-validation). For the LASSO-based results. Number of Instruments is the number of 
instruments selected by LASSO. 



